Multilingual BERT Based Word Alignment By Incorporating Common Chinese Characters

نویسندگان

چکیده

Word alignment is an important task of detecting translation equivalents between a sentence pair. Although word no longer necessarily needed for neural machine translation, it’s still useful in wealth applications, e.g., bilingual lexicon induction, constraint decoding, and so on. However, the most well-known aligners are Giza++ fastAlign, both which implementations traditional IBM models. To keep pace with advance NMT, there has been surge interest replacing models We follow this trend but aim to boost performance Japanese Chinese, share large portion Chinese characters. Our key idea leverage these common characters languages as indicator inferring alignment; i.e., source target words should be likely aligned. Following idea, we propose three methods that mBERT-based alignment, including reward factor, representation contrastive training. Furthermore, annotate release golden dataset Japanese-Chinese alignment. Experiments on show our outperform several strong baselines terms AER score verify effectiveness exploiting

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information

We describe a method to detect common Chinese characters between Japanese and Chinese automatically by means of freely available resources and verify the effectiveness of the detecting method. We use a joint phrase alignment model on dependency trees and report results of experiments aimed at improving the alignment quality between Japanese and Chinese by incorporating the common Chinese charac...

متن کامل

Word Order Typology through Multilingual Word Alignment

With massively parallel corpora of hundreds or thousands of translations of the same text, it is possible to automatically perform typological studies of language structure using very large language samples. We investigate the domain of word order using multilingual word alignment and high-precision annotation transfer in a corpus with 1144 translations in 986 languages of the New Testament. Re...

متن کامل

Chinese Word Segmentation by Classification of Characters

During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM,...

متن کامل

Improving Word Alignment by Adjusting Chinese Word Segmentation

Most of the current Chinese word alignment tasks often adopt word segmentation systems firstly to identify words. However, word-mismatching problems exist between languages and will degrade the performance of word alignment. In this paper, we propose two unsupervised methods to adjust word segmentation to make the tokens 1-to-1 mapping as many as possible between the corresponding sentences. Th...

متن کامل

Language comparison through sparse multilingual word alignment

In this paper, we propose a novel approach to compare languages on the basis of parallel texts. Instead of using word lists or abstract grammatical characteristics to infer (phylogenetic) relationships, we use multilingual alignments of words in sentences to establish measures of language similarity. To this end, we introduce a new method to quickly infer a multilingual alignment of words, usin...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2023

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3594634